Vaibhavi Mulay
Airbnb is a paid community platform for renting and booking private accommodation founded in 2008. Airbnb allows individuals to rent all or part of their own home as extra accommodation. The site offers a search and booking platform between the person offering their accommodation and the vacationer who wishes to rent it. It covers more than 1.5 million advertisements in more than 34,000 cities and 191 countries. From creation, in August 2008, until June 2012, more than 10 million nights have been booked on Airbnb.
Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.
Best prediction model for price i.e relationship between the price and other factors
Travelers and Hosts using Airbnb
The dataset which we have used over here is New York City Airbnb Open Data. The dataset is available on kaggle. It has 16 columns and 48895 rows.
Below you will find the implementation of a few processes we have done for analysis. You can jump to the sections:
1. Data Cleaning
2. Exploratory Data Cleaning
3. Statistics and Machine Learning
First we will import the library such as numpy, scipy and matplotlib to manipulate, analyze and visualize our data. The second task for setting up our data set is by importing our dataset from a csv to our notebook. Here the csv file is converted into a set of data frames
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline
import seaborn as sns
import pandas_profiling
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from math import sqrt
from sklearn.metrics import r2_score
#using pandas library and 'read_csv' function to read BlackFriday csv file as file already formated for us from Kaggle
airbnb=pd.read_csv('AB_NYC_2019.csv')
#examing head of BlackFriday csv file
airbnb.head(10)
#profiling helps understanding the distribution of data
pandas_profiling.ProfileReport(airbnb)
The first step we will do here is cleaning our data. Here we will do operations such as getting our data into a standard format, handling null values, removing unneccesary columns or values etc.
airbnb.info()
total = airbnb.isnull().sum().sort_values(ascending=False)
percent = ((airbnb.isnull().sum())*100)/airbnb.isnull().count().sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total','Percent'], sort=False).sort_values('Total', ascending=False)
missing_data.head(40)
airbnb['adjusted_price'] = airbnb.price/airbnb.minimum_nights
airbnb.head()
airbnb["last_review"] = pd.to_datetime(airbnb.last_review)
airbnb.head()
airbnb["reviews_per_month"] = airbnb["reviews_per_month"].fillna(airbnb["reviews_per_month"].mean())
airbnb.head()
airbnb.last_review.fillna(method="ffill", inplace=True)
airbnb.head()
for column in airbnb.columns:
if airbnb[column].isnull().sum() != 0:
print("=======================================================")
print(f"{column} ==> Missing Values : {airbnb[column].isnull().sum()}, dtypes : {airbnb[column].dtypes}")
for column in airbnb.columns:
if airbnb[column].isnull().sum() != 0:
airbnb[column] = airbnb[column].fillna(airbnb[column].mode()[0])
airbnb.isnull().sum()
pd.options.display.float_format = "{:.2f}".format
airbnb.describe()
# Drop ["id", "host_name"] because it is insignificant and also for ethical reasons.
airbnb.drop(["id", "host_name"], axis="columns", inplace=True)
airbnb.head()
categorical_col = []
for column in airbnb.columns:
if len(airbnb[column].unique()) <= 10:
print("===============================================================================")
print(f"{column} : {airbnb[column].unique()}")
categorical_col.append(column)
Exporatory Data analysis or EDA is an approach to analyzing your dataset to summarize their characteristics often with visual methods. For the above given dataset we have explored the attributes using appropriate graphical model. This will help us to understand the nature of our data, its behavoir and so on. In the below sections we will analyze our data that with try to answers quesion like why, where and how the factors affect the airbnb ratings and prices.
import plotly.graph_objs as go
#Access token from Plotly
mapbox_access_token = 'pk.eyJ1Ijoia3Jwb3BraW4iLCJhIjoiY2pzcXN1eDBuMGZrNjQ5cnp1bzViZWJidiJ9.ReBalb28P1FCTWhmYBnCtA'
#Prepare data for Plotly
data = [
go.Scattermapbox(
lat=airbnb.latitude,
lon=airbnb.longitude,
mode='markers',
text=airbnb[['neighbourhood_group','number_of_reviews','adjusted_price']],
marker=dict(
size=7,
color=airbnb.adjusted_price,
colorscale='RdBu',
reversescale=True,
colorbar=dict(
title='Adjusted Price'
)
),
)
]
#Prepare layout for Plotly
layout = go.Layout(
autosize=True,
hovermode='closest',
title='NYC Airbnb ',
mapbox=dict(
accesstoken=mapbox_access_token,
bearing=0,
center=dict(
lat=40.721319,
lon=-73.987130
),
pitch=0,
zoom=11
),
)
from plotly.offline import init_notebook_mode, iplot
#Create map using Plotly
fig = dict(data=data, layout=layout)
iplot(fig, filename='NYC Airbnb')
airbnb[airbnb.adjusted_price > 5000]
import plotly.express as px
## Setting up the Visualization..
fig = px.scatter_mapbox(airbnb,
hover_data = ['price','minimum_nights','room_type'],
hover_name = 'neighbourhood',
lat="latitude",
lon="longitude",
color="neighbourhood_group",
size="price",
# color_continuous_scale=px.colors.cyclical.IceFire,
size_max=30,
opacity = .70,
zoom=10,
)
# "open-street-map", "carto-positron", "carto-darkmatter", "stamen-terrain", "stamen-toner" or
# "stamen-watercolor" yeild maps composed of raster tiles from various public tile servers which do
# not require signups or access tokens
# fig.update_layout(mapbox_style="carto-positron",
# )
fig.layout.mapbox.style = 'stamen-terrain'
fig.update_layout(title_text = 'Airbnb by Borough in NYC<br>(Click legend to toggle borough)', height = 800)
fig.show()
The first graph is about the relationship between price and room type. The Shared room price is always lower than 2000 dollars. On the other hand, the private room and the entire home have the highest price in some.
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(15,12))
sns.scatterplot(x='room_type', y='price', data=airbnb)
plt.xlabel("Room Type", size=13)
plt.ylabel("Price", size=13)
plt.title("Room Type vs Price",size=15, weight='bold')
Below graph shows details about price and room type based on neighborhood group. The highest price of Private Room and Entire Home/Apt is in the same area which is Manhattan. Also, Brooklyn has very-high prices both in Private Room and Entire Home/Apt. On the other hand, shared room's highest price is in the Queens area and also in Staten Island.
plt.figure(figsize=(20,15))
sns.scatterplot(x="room_type", y="price",
hue="neighbourhood_group", size="neighbourhood_group",
sizes=(50, 200), palette="Dark2", data=airbnb)
plt.xlabel("Room Type", size=13)
plt.ylabel("Price", size=13)
plt.title("Room Type vs Price vs Neighbourhood Group",size=15, weight='bold')
f,ax=plt.subplots(1,2,figsize=(18,8))
airbnb['neighbourhood_group'].value_counts().plot.pie(explode=[0,0.05,0,0,0],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Share of Neighborhood')
ax[0].set_ylabel('Neighborhood Share')
sns.countplot('neighbourhood_group',data=airbnb,ax=ax[1],order=airbnb['neighbourhood_group'].value_counts().index)
ax[1].set_title('Share of Neighborhood')
plt.show()
plt.figure(figsize=(10,6))
sns.distplot(airbnb[airbnb.neighbourhood_group=='Manhattan'].adjusted_price,color='maroon',hist=False,label='Manhattan')
sns.distplot(airbnb[airbnb.neighbourhood_group=='Brooklyn'].adjusted_price,color='black',hist=False,label='Brooklyn')
sns.distplot(airbnb[airbnb.neighbourhood_group=='Queens'].adjusted_price,color='green',hist=False,label='Queens')
sns.distplot(airbnb[airbnb.neighbourhood_group=='Staten Island'].adjusted_price,color='blue',hist=False,label='Staten Island')
sns.distplot(airbnb[airbnb.neighbourhood_group=='Long Island'].adjusted_price,color='lavender',hist=False,label='Long Island')
plt.title('Borough wise price destribution for adjusted_price<1000')
plt.xlim(0,1000)
plt.show()
#we can see from our statistical table that we have some extreme values, therefore we need to remove them for the sake of a better visualization
#creating a sub-dataframe with no extreme values / less than 500
sub_6=airbnb[airbnb.adjusted_price < 500]
#using violinplot to showcase density and distribtuion of prices
viz_2=sns.violinplot(data=sub_6, x='neighbourhood_group', y='adjusted_price')
viz_2.set_title('Density and distribution of prices for each neighberhood_group')
Great, with a statistical table and a violin plot we can definitely observe a couple of things about distribution of prices for Airbnb in NYC boroughs. First, we can state that Manhattan has the highest range of prices for the listings with 150 dollar price as average observation, followed by Brooklyn with 90 dollar per night. Queens and Staten Island appear to have very similar distributions, Bronx is the cheapest of them all. This distribution and density of prices were completely expected; for example, as it is no secret that Manhattan is one of the most expensive places in the world to live in, where Bronx on other hand appears to have lower standards of living.
from scipy.stats import norm
plt.figure(figsize=(10,10))
sns.distplot(airbnb['price'], fit=norm)
plt.title("Price Distribution Plot",size=15, weight='bold')
The above distribution graph shows that there is a right-skewed distribution on price. This means there is a positive skewness. Log transformation will be used to make this feature less skewed. This will help to make easier interpretation and better statistical analysis
Since division by zero is a problem, log+1 transformation would be better.
airbnb['price_log'] = np.log(airbnb.price+1)
With help of log transformation, now, price feature have normal distribution.
plt.figure(figsize=(12,10))
sns.distplot(airbnb['price_log'], fit=norm)
plt.title("Log-Price Distribution Plot",size=15, weight='bold')
In below graph, the good fit indicates that normality is a reasonable approximation.
from scipy import stats
plt.figure(figsize=(7,7))
stats.probplot(airbnb['price_log'], plot=plt)
plt.show()
airbnb['neighbourhood_group']= airbnb['neighbourhood_group'].astype("category").cat.codes
airbnb['neighbourhood'] = airbnb['neighbourhood'].astype("category").cat.codes
airbnb['room_type'] = airbnb['room_type'].astype("category").cat.codes
airbnb.info()
airbnb_model = airbnb.drop(columns=['name','host_id',
'last_review','price','adjusted_price'])
plt.figure(figsize=(15,12))
palette = sns.diverging_palette(20, 220, n=256)
corr=airbnb_model.corr(method='pearson')
sns.heatmap(corr, annot=True, fmt=".2f", cmap=palette, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5}).set(ylim=(11, 0))
plt.title("Correlation Matrix",size=15, weight='bold')
The correlation table shows that there is no strong relationship between price and other features. This indicates no feature needed to be taken out of data.
Residual Plot is strong method to detect outliers, non-linear data and detecting data for regression models. The below charts show the residual plots for each feature with the price.
An ideal Residual Plot, the red line would be horizontal. Based on the below charts, most features are non-linear. On the other hand, there are not many outliers in each feature. This result led to underfitting. Underfitting can occur when input features do not have a strong relationship to target variables or over-regularized. For avoiding underfitting new data features can be added or regularization weight could be reduced.
In this kernel, since the input feature data could not be increased, Regularized Linear Models will be used for regularization and polynomial transformation will be made to avoid underfitting.
airbnb_model_x, airbnb_model_y = airbnb_model.iloc[:,:-1], airbnb_model.iloc[:,-1]
f, axes = plt.subplots(5, 2, figsize=(15, 20))
sns.residplot(airbnb_model_x.iloc[:,0],airbnb_model_y, lowess=True, ax=axes[0, 0],
scatter_kws={'alpha': 0.5},
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,1],airbnb_model_y, lowess=True, ax=axes[0, 1],
scatter_kws={'alpha': 0.5},
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,2],airbnb_model_y, lowess=True, ax=axes[1, 0],
scatter_kws={'alpha': 0.5},
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,3],airbnb_model_y, lowess=True, ax=axes[1, 1],
scatter_kws={'alpha': 0.5},
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,4],airbnb_model_y, lowess=True, ax=axes[2, 0],
scatter_kws={'alpha': 0.5},
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,5],airbnb_model_y, lowess=True, ax=axes[2, 1],
scatter_kws={'alpha': 0.5},
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,6],airbnb_model_y, lowess=True, ax=axes[3, 0],
scatter_kws={'alpha': 0.5},
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,7],airbnb_model_y, lowess=True, ax=axes[3, 1],
scatter_kws={'alpha': 0.5},
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,8],airbnb_model_y, lowess=True, ax=axes[4, 0],
scatter_kws={'alpha': 0.5},
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,9],airbnb_model_y, lowess=True, ax=axes[4, 1],
scatter_kws={'alpha': 0.5},
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
plt.setp(axes, yticks=[])
plt.tight_layout()
Multicollinearity will help to measure the relationship between explanatory variables in multiple regression. If there is multicollinearity occurs, these highly related input variables should be eliminated from the model.
In this kernel, multicollinearity will be control with Eigen vector values results.
multicollinearity, V=np.linalg.eig(corr)
multicollinearity
None one of the eigenvalues of the correlation matrix is close to zero. It means that there is no multicollinearity exists in the data.
First, Standard Scaler technique will be used to normalize the data set. Thus, each feature has 0 mean and 1 standard deviation.
scaler = StandardScaler()
airbnb_model_x = scaler.fit_transform(airbnb_model_x)
Secondly, data will be split in a 70–30 ratio
X_train, X_test, y_train, y_test = train_test_split(airbnb_model_x, airbnb_model_y, test_size=0.3,random_state=42)
Now it is time to build a feature importance graph. For this Extra Trees Classifier method will be used. In the below code, lowess=True makes sure the lowest regression line is drawn.
lab_enc = preprocessing.LabelEncoder()
feature_model = ExtraTreesClassifier(n_estimators=50)
feature_model.fit(X_train,lab_enc.fit_transform(y_train))
plt.figure(figsize=(7,7))
feat_importances = pd.Series(feature_model.feature_importances_, index=airbnb_model.iloc[:,:-1].columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()
The above graph shows the feature importance of dataset. According to that, neighborhood group and room type have the lowest importance on the model. Under this result, the model building will be made in 2 phases. In the first phase, models will be built within all features and in the second phase, models will be built without neighborhood group and room type features.
Correlation matrix, Residual Plots and Multicollinearity results show that underfitting occurs on the model and there is no multicollinearity on the independent variables. Avoiding underfitting will be made with Polynomial Transformation since no new features can not be added or replaced with the existing ones.
In model building section, Linear Regression, Ridge Regression, Lasso Regression, and ElasticNet Regression models will be built. These models will be used to avoiding plain Linear Regression and show the results with a little of regularization.
First, GridSearchCV algorithm will be used to find the best parameters and tuning hyperparameters for each model. In this algorithm 5-Fold Cross Validation and Mean Squared Error Regression Loss metrics will be used.
def linear_reg(input_x, input_y, cv=5):
## Defining parameters
model_LR= LinearRegression()
parameters = {'fit_intercept':[True,False], 'normalize':[True,False], 'copy_X':[True, False]}
## Building Grid Search algorithm with cross-validation and Mean Squared Error score.
grid_search_LR = GridSearchCV(estimator=model_LR,
param_grid=parameters,
scoring='neg_mean_squared_error',
cv=cv,
n_jobs=-1)
## Lastly, finding the best parameters.
grid_search_LR.fit(input_x, input_y)
best_parameters_LR = grid_search_LR.best_params_
best_score_LR = grid_search_LR.best_score_
print(best_parameters_LR)
print(best_score_LR)
def ridge_reg(input_x, input_y, cv=5):
## Defining parameters
model_Ridge= Ridge()
# prepare a range of alpha values to test
alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
normalizes= ([True,False])
## Building Grid Search algorithm with cross-validation and Mean Squared Error score.
grid_search_Ridge = GridSearchCV(estimator=model_Ridge,
param_grid=(dict(alpha=alphas, normalize= normalizes)),
scoring='neg_mean_squared_error',
cv=cv,
n_jobs=-1)
## Lastly, finding the best parameters.
grid_search_Ridge.fit(input_x, input_y)
best_parameters_Ridge = grid_search_Ridge.best_params_
best_score_Ridge = grid_search_Ridge.best_score_
print(best_parameters_Ridge)
print(best_score_Ridge)
def lasso_reg(input_x, input_y, cv=5):
## Defining parameters
model_Lasso= Lasso()
# prepare a range of alpha values to test
alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
normalizes= ([True,False])
## Building Grid Search algorithm with cross-validation and Mean Squared Error score.
grid_search_lasso = GridSearchCV(estimator=model_Lasso,
param_grid=(dict(alpha=alphas, normalize= normalizes)),
scoring='neg_mean_squared_error',
cv=cv,
n_jobs=-1)
## Lastly, finding the best parameters.
grid_search_lasso.fit(input_x, input_y)
best_parameters_lasso = grid_search_lasso.best_params_
best_score_lasso = grid_search_lasso.best_score_
print(best_parameters_lasso)
print(best_score_lasso)
def elastic_reg(input_x, input_y,cv=5):
## Defining parameters
model_grid_Elastic= ElasticNet()
# prepare a range of alpha values to test
alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
normalizes= ([True,False])
## Building Grid Search algorithm with cross-validation and Mean Squared Error score.
grid_search_elastic = GridSearchCV(estimator=model_grid_Elastic,
param_grid=(dict(alpha=alphas, normalize= normalizes)),
scoring='neg_mean_squared_error',
cv=cv,
n_jobs=-1)
## Lastly, finding the best parameters.
grid_search_elastic.fit(input_x, input_y)
best_parameters_elastic = grid_search_elastic.best_params_
best_score_elastic = grid_search_elastic.best_score_
print(best_parameters_elastic)
print(best_score_elastic)
Before model building, 5-Fold Cross Validation will be implemented for validation.
kfold_cv=KFold(n_splits=5, random_state=42, shuffle=False)
for train_index, test_index in kfold_cv.split(airbnb_model_x,airbnb_model_y):
X_train, X_test = airbnb_model_x[train_index], airbnb_model_x[test_index]
y_train, y_test = airbnb_model_y[train_index], airbnb_model_y[test_index]
The polynomial transformation will be made with a second degree which adding the square of each feature.
Poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_train = Poly.fit_transform(X_train)
X_test = Poly.fit_transform(X_test)
##Linear Regression
lr = LinearRegression(copy_X= True, fit_intercept = True, normalize = True)
lr.fit(X_train, y_train)
lr_pred= lr.predict(X_test)
#Ridge Model
ridge_model = Ridge(alpha = 0.01, normalize = True)
ridge_model.fit(X_train, y_train)
pred_ridge = ridge_model.predict(X_test)
#Lasso Model
Lasso_model = Lasso(alpha = 0.001, normalize =False)
Lasso_model.fit(X_train, y_train)
pred_Lasso = Lasso_model.predict(X_test)
#ElasticNet Model
model_enet = ElasticNet(alpha = 0.01, normalize=False)
model_enet.fit(X_train, y_train)
pred_test_enet= model_enet.predict(X_test)
All steps from Phase 1, will be repeated in this Phase. The difference is, neighbourhood_group and room_type features will be eliminated.
airbnb_model_xx = airbnb_model.drop(columns=['neighbourhood_group', 'room_type'])
airbnb_model_xx, airbnb_model_yx = airbnb_model_xx.iloc[:,:-1], airbnb_model_xx.iloc[:,-1]
X_train_x, X_test_x, y_train_x, y_test_x = train_test_split(airbnb_model_xx, airbnb_model_yx, test_size=0.3,random_state=42)
scaler = StandardScaler()
airbnb_model_xx = scaler.fit_transform(airbnb_model_xx)
kfold_cv=KFold(n_splits=4, random_state=42, shuffle=False)
for train_index, test_index in kfold_cv.split(airbnb_model_xx,airbnb_model_yx):
X_train_x, X_test_x = airbnb_model_xx[train_index], airbnb_model_xx[test_index]
y_train_x, y_test_x = airbnb_model_yx[train_index], airbnb_model_yx[test_index]
Poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_train_x = Poly.fit_transform(X_train_x)
X_test_x = Poly.fit_transform(X_test_x)
###Linear Regression
lr_x = LinearRegression(copy_X= True, fit_intercept = True, normalize = True)
lr_x.fit(X_train_x, y_train_x)
lr_pred_x= lr_x.predict(X_test_x)
###Ridge
ridge_x = Ridge(alpha = 0.01, normalize = True)
ridge_x.fit(X_train_x, y_train_x)
pred_ridge_x = ridge_x.predict(X_test_x)
###Lasso
Lasso_x = Lasso(alpha = 0.001, normalize =False)
Lasso_x.fit(X_train_x, y_train_x)
pred_Lasso_x = Lasso_x.predict(X_test_x)
##ElasticNet
model_enet_x = ElasticNet(alpha = 0.01, normalize=False)
model_enet_x.fit(X_train_x, y_train_x)
pred_train_enet_x= model_enet_x.predict(X_train_x)
pred_test_enet_x= model_enet_x.predict(X_test_x)
In this part, 3 metrics will be calculated for evaluating predictions.
print('-------------Linear Regression-----------')
print('--Phase-1--')
print('MAE: %f'% mean_absolute_error(y_test, lr_pred))
print('RMSE: %f'% np.sqrt(mean_squared_error(y_test, lr_pred)))
print('R2 %f' % r2_score(y_test, lr_pred))
print('--Phase-2--')
print('MAE: %f'% mean_absolute_error(y_test_x, lr_pred_x))
print('RMSE: %f'% np.sqrt(mean_squared_error(y_test_x, lr_pred_x)))
print('R2 %f' % r2_score(y_test_x, lr_pred_x))
print('---------------Ridge ---------------------')
print('--Phase-1--')
print('MAE: %f'% mean_absolute_error(y_test, pred_ridge))
print('RMSE: %f'% np.sqrt(mean_squared_error(y_test, pred_ridge)))
print('R2 %f' % r2_score(y_test, pred_ridge))
print('--Phase-2--')
print('MAE: %f'% mean_absolute_error(y_test_x, pred_ridge_x))
print('RMSE: %f'% np.sqrt(mean_squared_error(y_test_x, pred_ridge_x)))
print('R2 %f' % r2_score(y_test_x, pred_ridge_x))
print('---------------Lasso-----------------------')
print('--Phase-1--')
print('MAE: %f' % mean_absolute_error(y_test, pred_Lasso))
print('RMSE: %f' % np.sqrt(mean_squared_error(y_test, pred_Lasso)))
print('R2 %f' % r2_score(y_test, pred_Lasso))
print('--Phase-2--')
print('MAE: %f' % mean_absolute_error(y_test_x, pred_Lasso_x))
print('RMSE: %f' % np.sqrt(mean_squared_error(y_test_x, pred_Lasso_x)))
print('R2 %f' % r2_score(y_test_x, pred_Lasso_x))
print('---------------ElasticNet-------------------')
print('--Phase-1 --')
print('MAE: %f' % mean_absolute_error(y_test,pred_test_enet)) #RMSE
print('RMSE: %f' % np.sqrt(mean_squared_error(y_test,pred_test_enet))) #RMSE
print('R2 %f' % r2_score(y_test, pred_test_enet))
print('--Phase-2--')
print('MAE: %f' % mean_absolute_error(y_test_x,pred_test_enet_x)) #RMSE
print('RMSE: %f' % np.sqrt(mean_squared_error(y_test_x,pred_test_enet_x))) #RMSE
print('R2 %f' % r2_score(y_test_x, pred_test_enet_x))
The results show that all models have similar prediction results. Phase 1 and 2 have a great difference for each metric. All metric values are increased in Phase 2 it means, the prediction error value is higher in that Phase and model explainability are very low the variability of the response data around mean.
from sklearn.model_selection import train_test_split, cross_val_score
def rmse_cv(model):
kf = KFold(5, shuffle=True, random_state = 91).get_n_splits(airbnb.price)
return cross_val_score(model, X_train, y_train, scoring='neg_mean_squared_error', cv=kf)
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
best_random = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=30,
max_features='sqrt', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=5,
min_weight_fraction_leaf=0.0, n_estimators=1400,
n_jobs=None, oob_score=False, random_state=42, verbose=0,
warm_start=False)
rfr_CV_best = -rmse_cv(best_random)
best_random.fit(X_train, y_train)
y_train_rfr = best_random.predict(X_train)
y_test_rfr = best_random.predict(X_test)
rfr_best_results = pd.DataFrame({'algorithm':['Random Forest Regressor'],
'CV error': rfr_CV_best.mean(),
'CV std': rfr_CV_best.std(),
'training error': [mean_squared_error(y_train, y_train_rfr)],
'test error': [mean_squared_error(y_test, y_test_rfr)],
'training_r2_score': [r2_score(y_train, y_train_rfr)],
'test_r2_score': [r2_score(y_test, y_test_rfr)]})
rfr_best_results
Summarizing our findings
This Airbnb ('AB_NYC_2019') dataset for the 2019 year appeared to be a very rich dataset with a variety of columns that allowed us to do deep data exploration on each significant column presented.
By creating a map which shows adjusted price of each and every listing, we saw how the pricing was distributed for each and every listing over the New York.Also, how listings are distributed according to borough, number of listings belonging to each borough, how pricing was distributed among each and every borough was examined.
From this, we saw the mean price of listings for each and every borough which will help the customer to not overpay in specific area and not get fooled by the hosts.
A model was fitted fitted to predict price v/s every feature and price v/s all features expect neighbourhood_group, room_type. We saw that price dependent on each and every feature as the error for them was less than phase 2.
Also, The testing R-squared was around 60% which suggested the prediction model selected was good enough to predict the price.
Finally ElasticNet was the best model among others to predict the price.
For our data exploration purposes, it also would be nice to have couple additional features, such as positive and negative numeric (0-5 stars) reviews or 0-5 star average review for each listing; addition of these features would help to determine the best-reviewed hosts for NYC along with 'number_of_review' column that is provided.
If ratings were provided by the customers for each and every listing, depending on that, a recommendation system could be generated for them which will help to find the best listings according to their needs.